Genre classification for a corpus of academic webpages

نویسندگان

  • Erika Dalan
  • Serge Sharoff
چکیده

In this paper we report our analysis of the similarities between webpages that are crawled from European academic websites, and comparison of their distribution in terms of the English language variety (native English vs English as a lingua franca) and their language family (based on the country’s official language). After building a corpus of university webpages, we selected a set of relevant descriptors that can represent their text types using the framework of the Functional Text Dimensions. Manual annotation of a random sample of academic pages provides the basis for classifying the remaining texts on each dimension. Reliable thresholds are then determined in order to evaluate precision and assess the distribution of text types by each dimension, with the ultimate goal of analysing language features over English varieties and language families.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Prestigious World University on its Homepage: The Promotional Academic Genre of Overview

In response to the competitive demands for establishing their international academic and financial credentials, the universities globally distribute some online introductory information about themselves. To this end, the university homepages have increasingly turned into the rhetorical space for the development of promotional academic texts in recent years. In this study, we examined university...

متن کامل

Promotion of Self in an Other-Oriented Academic Sub-Genre: The Case of Self-Mention in Acknowledgments

Although sometimes considered to act only as a means of recognizing debts, acknowledgments give the opportunity for writers to display a self-conscious and reflective representation of self. Following this assumption and to reveal some of the ways this is achieved, a corpus of 80 textbook acknowledgments in the field of Linguistics and Applied Linguistics were analyzed in order to show what “se...

متن کامل

Hedges in English for Academic Purposes: A Corpus-based study of Iranian EFL learners

Hedges, as tools to express tentativeness and doubt, have been studied in plenty of research papers in the Iranian EFL research setting. However, their use in a learner corpus, portraying Iranian learner English, is in need of more research attention. With this end in view, this study aimed at investigating how Iranian EFL learners who have majored in English-related fields in Iran deployed hed...

متن کامل

The Web Library of Babel: evaluating genre collections

We present experiments in automatic genre classification on web corpora, comparing a wide variety of features on several different genreannotated datasets (HGC, I-EN, KI-04, KRYS-I, MGC and SANTINIS). We investigate the performance of several types of features (POS n-grams, character n-grams and word n-grams) and show that simple character n-grams perform best on current collections because of ...

متن کامل

Learning to recognize webpage genres

Webpages are mainly distinguished by their topic (e.g., politics, sports etc.) and genre (e.g., blogs, homepages, e-shops, etc.). Automatic detection of webpage genre could considerably enhance the ability of modern search engines to focus on the requirements of the user’s information need. In this paper, we present an approach to webpage genre detection based on a fully-automated extraction of...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016